Say-as classification for alphabetic words in Japanese texts

نویسندگان

Hisako Asano

Masaaki Nagata

Masanobu Abe

چکیده

Modern Japanese texts often include Western sourced words written in Roman alphabet. For example, a shopping directory in a web portal, which lists more than 8,000 shops, includes a total of 6,400 alphabetic words. As most of them are very new and idiosyncratic proper nouns, it is impractical to assume all those alphabetic words can be registered in the word dictionary of a text-to-speech synthesis system; their pronunciations must be derived automatically. Our solution consists of two steps. Step 1 classifies each unknown alphabetic word into a say-as class (English, Japanese, French, Italian or English spell-out), which indicates how it is to be read, and Step 2 derives the pronunciation using the grapheme-to-phoneme conversion rules for the classified sayas class. This paper proposes a method of say-as classification (i.e. Step 1) that uses the Support Vector Machine. After some trial and error, we achieved 89.2% accuracy for web shop data, which we think sufficient for practical use.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Long vowel detection for letter-to-sound conversion for Japanese sourced words transliterated into the alphabet

Modern Japanese texts often include Western sourced words written in the Roman alphabet. Even Japanese sourced words are sometimes transliterated into the Roman alphabet. As most of them are very new and idiosyncratic proper nouns, it is impractical to assume all those alphabetic words can be registered in the word dictionary of a text-to-speech system; their pronunciation must be derived autom...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Orthographic Reading Deficits in Dyslexic Japanese Children: Examining the Transposed-Letter Effect in the Color-Word Stroop Paradigm

In orthographic reading, the transposed-letter effect (TLE) is the perception of a transposed-letter position word such as "cholocate" as the correct word "chocolate." Although previous studies on dyslexic children using alphabetic languages have reported such orthographic reading deficits, the extent of orthographic reading impairment in dyslexic Japanese children has remained unknown. This st...

متن کامل

High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information

This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of...

متن کامل

Acquired Dyslexia in Three Writing Systems: Study of a Portuguese-Japanese Bilingual Aphasic Patient

The Japanese language is represented by two different codes: syllabic and logographic while Portuguese employs an alphabetic writing system. Studies on bilingual Portuguese-Japanese individuals with acquired dyslexia therefore allow an investigation of the interaction between reading strategies and characteristics of three different writing codes. The aim of this study was to examine the differ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Say-as classification for alphabetic words in Japanese texts

نویسندگان

چکیده

منابع مشابه

Long vowel detection for letter-to-sound conversion for Japanese sourced words transliterated into the alphabet

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Orthographic Reading Deficits in Dyslexic Japanese Children: Examining the Transposed-Letter Effect in the Color-Word Stroop Paradigm

High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information

Acquired Dyslexia in Three Writing Systems: Study of a Portuguese-Japanese Bilingual Aphasic Patient

عنوان ژورنال:

اشتراک گذاری